-
Notifications
You must be signed in to change notification settings - Fork 41
feat: zarr3 #220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: zarr3 #220
Conversation
8371f18
to
07edf73
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #220 +/- ##
==========================================
+ Coverage 72.96% 73.85% +0.88%
==========================================
Files 10 10
Lines 825 872 +47
==========================================
+ Hits 602 644 +42
- Misses 223 228 +5 ☔ View full report in Codecov by Sentry. |
7c0f187
to
1459e83
Compare
@floriankrb what was the conclusion in terms of adding/updating to zarr 3? |
@anaprietonem I updated the description of the PR |
Just adding a small comment here. You've probably considered this already but here we go. The transition to Zarr v3 specification comes with a nice opportunity: we could use the sharding feature! It would allow us to have chunked variables (avoiding having to load them all in memory before taking a subset) while still retaining very high read speeds. Basically we would make each timestamp a shard, and then chunk variables inside the shard. The downside is slower write speed, but it’s not very important. More info: |
All tests are now passing with zarr2 and zarr3. At the moment of writing, zarr3 still does not support datetime64 (zarr-developers/zarr-python#2616) Tests are much more slower with zarr3, this will require more investigation. For example The profiler shows for zarr2:
and for zarr3:
The test uses an in-memory zarr store. The test calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@floriankrb @b8raoult this is fantastic, thank you for the hard work! We will try it this week or next and get back to you.
|
||
def __delitem__(self, key: str) -> None: | ||
"""Prevent deletion of items.""" | ||
raise NotImplementedError() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this operation is illegale, it is not a not implementer
error
"""Prevent deletion of items.""" | ||
raise NotImplementedError() | ||
|
||
def __setitem__(self, key: str, value: bytes) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
def __getitem__(self, key: str) -> bytes: | ||
"""Retrieve an item from the store and print debug information.""" | ||
# print() | ||
print("GET", key, self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am putting linter rules in wegen to disable all print statements
from typing import Any | ||
from typing import Optional | ||
|
||
import zarr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you could move this import into create_array
and put a version check there (you already have it in zarr_2_or_3)
def create_array(zarr_root, *args, **kwargs): | ||
if "compressor" in kwargs and kwargs["compressor"] is None: | ||
# compressor is deprecated, use compressors instead | ||
kwargs.pop("compressor") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no need for checking if the key is there. you can do kwargs.pop("x", None)
import numpy as np | ||
|
||
if dtype == "datetime64[s]": | ||
dtype = np.dtype("int64") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.int64?
else: | ||
LOG.warning("⚠️" * 80) | ||
LOG.warning( | ||
f"Only Zarr version 2 is supported when creating datasets, found version: {zarr.__version__}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not support writing in zarr3 format for the time being, only reading. Is there a use case for it?
Description
Anemoi-datasets should be agnostic to the version of zarr. It should run with zarr3 installed or with zarr2 installed.
We should also take into account that zarr2 code cannot read datasets created by zarr3. So, ideally, we should
1 - update anemoi-datasets to work with zarr2 and zarr3
2 - keep using zarr2 to build datasets (dependency
anemoi-datasets[create]
on zarr<=2, but have zarr2 or 3 when reading (dependency ofanemoi-datasets
on zarr3 - when user have updated their environment (6 months?), start building zarr3 datasets
This PR mostly addresses the first point, making anemoi-datasets detect the version of zarr and adapt to it.
What is still missing is:
📚 Documentation preview 📚: https://anemoi-datasets--220.org.readthedocs.build/en/220/